Dataset Analysis

Alpaca Dataset Comprehensive Quality Analysis

2025-12-07 13:30:29 52002 conversations 156006 messages

Overview

Dataset
tatsu-lab/alpaca
Conversations
52002
Analyzed
52002
Coverage
100.0%
Messages
156006
Analyzers
content_pattern, length, quality, diversity, training_quality

Recommendations

27 issues
high Truncated responses detected
Found 7958 assistant responses (15.3%) that appear to be truncated (ending mid-sentence or with incomplete punctuation). Training on truncated responses may cause the model to generate incomplete outputs. Consider completing or removing these samples.
7958 samples text_content_training_quality_has_proper_ending
medium Outliers detected in diversity vocabulary richness
Found 1612 samples (1.0%) with values outside 3.0 standard deviations from the mean. High outliers: 1612, Low outliers: 0. Consider reviewing these samples for potential data quality issues.
1612 samples text_content_diversity_vocabulary_richness
medium Inconsistent instruction formatting detected
Found multiple instruction format patterns in the dataset: alpaca: 20679, vicuna: 34. Mixing formats may confuse the model and reduce training effectiveness. Consider standardizing to a single format.
20713 samples text_content
low Outliers detected in content pattern placeholder count
Found 286 samples (0.2%) with values outside 3.0 standard deviations from the mean. High outliers: 286, Low outliers: 0. Consider reviewing these samples for potential data quality issues.
286 samples text_content_content_pattern_placeholder_count
low Multimodal distribution in length char count
Detected 5 distinct modes in the distribution (confidence: 60%). Mode 1: 98346 samples (61.6%), mean=76.51, std=31.09; Mode 3: 31070 samples (20.7%), mean=158.35, std=13.39; Mode 2: 19896 samples (12.6%), mean=371.49, std=100.23; Mode 5: 5670 samples (4.1%), mean=697.54, std=107.36; Mode 4: 1024 samples (1.0%), mean=1338.58, std=382.77. This may indicate different types of content (e.g., short questions vs long explanations). Outlier detection uses per-mode statistics to avoid false positives.
156006 samples text_content_length_char_count
low Outliers detected in length char count
Found 887 samples (0.6%) that are outliers within their respective modes. Distribution has 5 modes (mode 0: μ=76.51, σ=31.09, mode 2: μ=158.35, σ=13.39, mode 1: μ=371.49, σ=100.23, mode 4: μ=697.54, σ=107.36, mode 3: μ=1338.58, σ=382.77). Outliers are samples more than 3.0 std from mode mean.
887 samples text_content_length_char_count
low Multimodal distribution in length word count
Detected 4 distinct modes in the distribution (confidence: 43%). Mode 1: 44369 samples (31.3%), mean=7.83, std=3.03; Mode 4: 84998 samples (50.1%), mean=19.32, std=4.68; Mode 2: 24750 samples (16.6%), mean=69.47, std=24.14; Mode 3: 1889 samples (2.0%), mean=187.98, std=59.21. This may indicate different types of content (e.g., short questions vs long explanations). Outlier detection uses per-mode statistics to avoid false positives.
156006 samples text_content_length_word_count
low Outliers detected in length word count
Found 478 samples (0.3%) that are outliers within their respective modes. Distribution has 4 modes (mode 0: μ=7.83, σ=3.03, mode 3: μ=19.32, σ=4.68, mode 1: μ=69.47, σ=24.14, mode 2: μ=187.98, σ=59.21). Outliers are samples more than 3.0 std from mode mean.
478 samples text_content_length_word_count
low Multimodal distribution in length token count
Detected 5 distinct modes in the distribution (confidence: 57%). Mode 1: 92322 samples (55.9%), mean=14.33, std=5.04; Mode 3: 34284 samples (24.4%), mean=28.0, std=2.94; Mode 4: 23431 samples (14.5%), mean=70.98, std=21.11; Mode 2: 4657 samples (3.9%), mean=138.86, std=19.54; Mode 5: 1312 samples (1.2%), mean=257.67, std=77.4. This may indicate different types of content (e.g., short questions vs long explanations). Outlier detection uses per-mode statistics to avoid false positives.
156006 samples text_content_length_token_count
low Outliers detected in length token count
Found 1135 samples (0.7%) that are outliers within their respective modes. Distribution has 5 modes (mode 0: μ=14.33, σ=5.04, mode 2: μ=28.0, σ=2.94, mode 3: μ=70.98, σ=21.11, mode 1: μ=138.86, σ=19.54, mode 4: μ=257.67, σ=77.4). Outliers are samples more than 3.0 std from mode mean.
1135 samples text_content_length_token_count
low Outliers detected in quality pii count
Found 183 samples (0.1%) with values outside 3.0 standard deviations from the mean. High outliers: 183, Low outliers: 0. Consider reviewing these samples for potential data quality issues.
183 samples text_content_quality_pii_count
low Outliers detected in diversity unique words ratio
Found 1211 samples (0.8%) with values outside 3.0 standard deviations from the mean. High outliers: 0, Low outliers: 1211. Consider reviewing these samples for potential data quality issues.
1211 samples text_content_diversity_unique_words_ratio
low Outliers detected in diversity type token ratio
Found 1211 samples (0.8%) with values outside 3.0 standard deviations from the mean. High outliers: 0, Low outliers: 1211. Consider reviewing these samples for potential data quality issues.
1211 samples text_content_diversity_type_token_ratio
low Multimodal distribution in training quality instruction word count
Detected 5 distinct modes in the distribution (confidence: 46%). Mode 1: 30488 samples (53.0%), mean=9.29, std=2.35; Mode 3: 13901 samples (31.0%), mean=16.87, std=2.24; Mode 4: 6029 samples (12.4%), mean=26.79, std=4.13; Mode 2: 1390 samples (3.0%), mean=50.38, std=10.58; Mode 5: 194 samples (0.6%), mean=104.11, std=39.85. This may indicate different types of content (e.g., short questions vs long explanations). Outlier detection uses per-mode statistics to avoid false positives.
52002 samples text_content_training_quality_instruction_word_count
low Outliers detected in training quality instruction word count
Found 5 samples (0.0%) that are outliers within their respective modes. Distribution has 5 modes (mode 0: μ=9.29, σ=2.35, mode 2: μ=16.87, σ=2.24, mode 3: μ=26.79, σ=4.13, mode 1: μ=50.38, σ=10.58, mode 4: μ=104.11, σ=39.85). Outliers are samples more than 3.0 std from mode mean.
5 samples text_content_training_quality_instruction_word_count
low Multimodal distribution in training quality response word count
Detected 5 distinct modes in the distribution (confidence: 60%). Mode 2: 21640 samples (39.5%), mean=8.22, std=5.1; Mode 4: 14893 samples (30.0%), mean=39.77, std=12.61; Mode 1: 11777 samples (21.1%), mean=81.28, std=13.12; Mode 5: 2809 samples (6.8%), mean=130.69, std=16.7; Mode 3: 883 samples (2.6%), mean=229.97, std=62.52. This may indicate different types of content (e.g., short questions vs long explanations). Outlier detection uses per-mode statistics to avoid false positives.
52002 samples text_content_training_quality_response_word_count
low Outliers detected in training quality response word count
Found 20 samples (0.0%) that are outliers within their respective modes. Distribution has 5 modes (mode 1: μ=8.22, σ=5.1, mode 3: μ=39.77, σ=12.61, mode 0: μ=81.28, σ=13.12, mode 4: μ=130.69, σ=16.7, mode 2: μ=229.97, σ=62.52). Outliers are samples more than 3.0 std from mode mean.
20 samples text_content_training_quality_response_word_count
low Empty or near-empty messages detected
Found 1188 messages (0.8%) with 5 or fewer characters. These may indicate data quality issues or placeholder content that should be reviewed.
1188 samples text_content
low Many short messages detected
Found 29201 messages (18.7%) with fewer than 10 words. This may be intentional (e.g., short responses) or indicate low-quality samples worth reviewing.
29201 samples text_content_length_word_count
low Low quality samples detected
Found 1 messages (0.0%) with quality scores below 0.5. Average quality score: 1.00. Consider filtering or reviewing low-quality samples before training.
1 samples text_content_quality_quality_score
low Encoding issues detected
Found 2 messages (0.0%) with potential encoding issues (e.g., mojibake, invalid characters). These may indicate data corruption or incorrect character encoding. Consider re-encoding or cleaning affected samples.
2 samples text_content_quality_has_encoding_issues
low Highly repetitive content detected
Found 135 messages (0.1%) with high repetition ratios. Repetitive content may cause the model to learn repetitive patterns. Consider reviewing or filtering these samples.
135 samples text_content_quality_has_high_repetition
low Incomplete responses detected
Found 3880 assistant responses (7.5%) with low completeness scores (below 0.5). Average completeness score: 0.90. Incomplete responses may teach the model to generate truncated or minimal outputs. Consider expanding short responses or removing low-quality samples.
3880 samples text_content_training_quality_response_completeness_score
low Placeholder text detected
Found 286 messages (0.2%) containing placeholder text (e.g., [Name], [Company Name], [Your...], [Insert...]). These indicate incomplete or template responses that should be filled in or removed before training.
286 samples text_content_content_pattern_has_placeholder
low AI hallucinated experiences detected
Found 26 messages (0.0%) containing fabricated first-person experiences (e.g., 'When I was working as a project manager...'). Training on these may cause the model to generate similar hallucinations. Consider removing or rewriting these samples.
26 samples text_content_content_pattern_has_hallucinated_experience
low Nooutput/NA markers detected
Found 38 messages (0.0%) containing nooutput markers (e.g., <nooutput>, N/A, None). These are unusable training samples and should be removed.
38 samples text_content_content_pattern_has_nooutput
low AI refusal patterns detected
Found 10 messages (0.0%) containing AI refusal patterns (e.g., 'I cannot provide...', 'I'm unable to help...'). These indicate the model refused to complete the task. Consider reviewing or removing these samples unless training for appropriate refusals.
10 samples text_content_content_pattern_has_refusal

Distributions

10 charts

Content Pattern Placeholder Count

Content Pattern Suspicious Url Count

Content Pattern Content Pattern Score

Length Char Count (5 modes)

multimodal 5 distinct modes detected
Mode 1 61.6%
Mean 76.5
Std 31.1
Count 98346
Mode 3 20.7%
Mean 158.3
Std 13.4
Count 31070
Mode 2 12.6%
Mean 371.5
Std 100.2
Count 19896
Mode 5 4.1%
Mean 697.5
Std 107.4
Count 5670
Mode 4 1.0%
Mean 1338.6
Std 382.8
Count 1024

Length Word Count (4 modes)

multimodal 4 distinct modes detected
Mode 1 31.3%
Mean 7.8
Std 3.0
Count 44369
Mode 4 50.1%
Mean 19.3
Std 4.7
Count 84998
Mode 2 16.6%
Mean 69.5
Std 24.1
Count 24750
Mode 3 2.0%
Mean 188.0
Std 59.2
Count 1889

Length Token Count (5 modes)

multimodal 5 distinct modes detected
Mode 1 55.9%
Mean 14.3
Std 5.0
Count 92322
Mode 3 24.4%
Mean 28.0
Std 2.9
Count 34284
Mode 4 14.5%
Mean 71.0
Std 21.1
Count 23431
Mode 2 3.9%
Mean 138.9
Std 19.5
Count 4657
Mode 5 1.2%
Mean 257.7
Std 77.4
Count 1312

Quality Pii Count

Quality Repetition Ratio

Quality Quality Score

Diversity Unique Words Ratio

Anomaly Detection

5 visualizations

Outliers in Content Pattern Placeholder Count

286 outliers

Outliers in Content Pattern Suspicious Url Count

1 outliers

Outliers in Content Pattern Content Pattern Score

358 outliers

Outliers in Length Char Count

3195 outliers

Outliers in Length Word Count

3119 outliers

Message Statistics

Metric Distribution Mean Std Min Max Median
text_content_placeholder_count unimodal 0.0 0.08 0.0 8.0 0.0
text_content_url_count unimodal 0.0 0.0 0.0 1.0 0.0
text_content_pattern_score unimodal 1.0 0.01 0.1 1.0 1.0
text_content_char_count multimodal (5) 161.28 181.72 0.0 4181.0 105.0
└ Mode 1 (61.6%) 98346 samples 76.51 31.09 - - -
└ Mode 3 (20.7%) 31070 samples 158.35 13.39 - - -
└ Mode 2 (12.6%) 19896 samples 371.49 100.23 - - -
└ Mode 5 (4.1%) 5670 samples 697.54 107.36 - - -
└ Mode 4 (1.0%) 1024 samples 1338.58 382.77 - - -
text_content_word_count multimodal (4) 26.05 29.75 0.0 717.0 16.0
└ Mode 1 (31.3%) 44369 samples 7.83 3.03 - - -
└ Mode 4 (50.1%) 84998 samples 19.32 4.68 - - -
└ Mode 2 (16.6%) 24750 samples 69.47 24.14 - - -
└ Mode 3 (2.0%) 1889 samples 187.98 59.21 - - -
text_content_token_count multimodal (5) 31.61 36.48 0.0 958.0 18.0
└ Mode 1 (55.9%) 92322 samples 14.33 5.04 - - -
└ Mode 3 (24.4%) 34284 samples 28.0 2.94 - - -
└ Mode 4 (14.5%) 23431 samples 70.98 21.11 - - -
└ Mode 2 (3.9%) 4657 samples 138.86 19.54 - - -
└ Mode 5 (1.2%) 1312 samples 257.67 77.4 - - -
text_content_pii_count unimodal 0.0 0.11 0.0 26.0 0.0
text_content_repetition_ratio unimodal 0.0 0.02 0.0 0.83 0.0
text_content_quality_score unimodal 1.0 0.01 0.46 1.0 1.0
text_content_words_ratio unimodal 0.88 0.11 0.0 1.0 0.88
text_content_token_ratio unimodal 0.88 0.11 0.0 1.0 0.88
text_content_vocabulary_richness unimodal 3.87 1.29 0.0 14.95 3.5
text_content_clarity_score unimodal 0.96 0.08 0.5 1.0 1.0
text_content_quality_instruction_word_count multimodal (5) 14.79 10.71 4.0 414.0 12.0
└ Mode 1 (53.0%) 30488 samples 9.29 2.35 - - -
└ Mode 3 (31.0%) 13901 samples 16.87 2.24 - - -
└ Mode 4 (12.4%) 6029 samples 26.79 4.13 - - -
└ Mode 2 (3.0%) 1390 samples 50.38 10.58 - - -
└ Mode 5 (0.6%) 194 samples 104.11 39.85 - - -
text_content_quality_response_word_count multimodal (5) 44.18 44.97 0.0 717.0 30.0
└ Mode 2 (39.5%) 21640 samples 8.22 5.1 - - -
└ Mode 4 (30.0%) 14893 samples 39.77 12.61 - - -
└ Mode 1 (21.1%) 11777 samples 81.28 13.12 - - -
└ Mode 5 (6.8%) 2809 samples 130.69 16.7 - - -
└ Mode 3 (2.6%) 883 samples 229.97 62.52 - - -
text_content_completeness_score unimodal 0.9 0.25 0.0 1.0 1.0
text_content_quality_score unimodal 1.0 0.0 0.8 1.0 1.0

Conversation Turns

Statistic Value
Count52002
Mean3.0
Std0.0
Min3
Max3
Median3.0

Affected Conversations